Linking the Center for International Blood and Marrow Transplant Research Registry to the California Cancer Registry and California Hospital Patient Discharge Data

Advances in hematopoietic cell transplantation (HCT) have substantially improved patient survival, increasing the importance of studying outcomes and long-term adverse effects in the rapidly growing population of HCT survivors. Large-scale registry data from the Center for International Blood and Marrow Transplant Research (CIBMTR) are a valuable resource for studying mortality and late effects after HCT, providing detailed data reported by HCT centers on transplantation-related factors and key outcomes. This study was conducted to evaluate the robustness of CIBMTR outcome data and assess health-related outcomes and healthcare utilization among HCT recipients. We linked data from the CIBMTR for California residents with data from the population-based California Cancer Registry (CCR) and hospitalization information from the California Patient Discharge Database (PDD). In this retrospective cohort study, probabilistic and deterministic record linkage used key patient identifiers, such as Social Security number, ZIP code, sex, birth date, hematologic malignancy type and diagnosis date, and HCT type and date. Among 22,733 patients registered with the CIBMTR who underwent autologous or allogeneic HCT for hematologic malignancy between 1991 and 2016, 89.0% were matched to the CCR and/or PDD (n = 17,707 [77.9%] for both, n = 1179 [5.2%] for the CCR only, and n = 1342 [5.9%] for the PDD only). Unmatched patients were slightly more likely to have undergone a first autologous HCT than an allogeneic HCT (12.6% versus 9.0%), to have a larger number of missing linkage identifiers, and to have undergone HCT prior to 2010. Among the patients reported to the CIBMTR who matched to the CCR, 85.7% demonstrated concordance of both hematologic malignancy type and diagnosis date across data sources. This linkage presents unparalleled opportunities to advance our understanding of HCT practices and patient outcomes.


INTRODUCTION
The application and curative potential of hematopoietic cell transplantation (HCT) for the treatment of hematologic malignancies, hereditary blood disorders, and certain autoimmune diseases has increased over the past several decades [1]. The annual number of HCTs performed in the United States has more than quadrupled since the early 1990s, with approximately 8900 allogeneic HCT recipients and 14,000 autologous HCT recipients in 2018 [2,3]. Improved survival and expansion of patient eligibility, particularly in older individuals, is predicted to result in dramatic growth of the population of US HCT survivors, from an estimated >100,000 in 2009 to 500,000 by 2030 [1,4].
The expansion of HCT and growing awareness of unique survivorship needs of recipients have increased the importance of studying outcomes and adverse effects in HCT survivors. Previous research has demonstrated that HCT recipients are at risk of developing a range of short-term and long-term complications, including subsequent malignancies, cardiovascular disease, infections, and premature mortality [5][6][7][8][9][10][11][12]. However, large numbers of patients with systematic, long-term follow-up from a variety of transplantation centers are needed to fully describe these adverse outcomes, which have been shown to vary substantially based on age and other patient demographic and clinical factors [6,8,10,13].
The largest and most comprehensive HCT registry in the United States is maintained by the Center for International Blood and Marrow Transplant Research (CIBMTR), a research collaboration between the National Marrow Donor Program/Be The Match and the Medical College of Wisconsin that systematically collects extensive HCT and outcome data for more than 600,000 patients from more than 500 transplantation centers internationally [14]. The CIBMTR holds the contract for the Stem Cell Therapeutic Outcomes Database, which was awarded in 2006 and mandates reporting of all allogeneic HCTs performed in the United States [2]. Reporting is voluntary for autologous HCTs, but an estimated 85% of these HCTs are captured [2]; however, the completeness and accuracy of HCT center-reported data, particularly for long-term outcomes, have not been systematically compared with other administrative or population-based data sources.
To evaluate the robustness of long-term outcome data in the CIBMTR registry and assess health-related outcomes and healthcare utilization among HCT recipients, we linked the CIBMTR database for California residents with the population-based California Cancer Registry (CCR) and hospitalization information from the California Department of Health Care Access and Information Patient Discharge Database (PDD) for all patients diagnosed with a hematologic malignancy between 1991 and 2016. We selected these databases for linkage because of their large scale and high data quality. Specifically, 11% of all US HCT recipients underwent transplantation in California in 2016 to 2020, the CCR has been estimated to capture >98% of all cancer diagnoses occurring in California residents, and the PDD captures all inpatient hospitalizations from nonfederal hospitals across the state [3,15,16]. In this article, we describe our linkage methodology, compare characteristics of the study populations by match status, and evaluate the concordance of HCT indications (ie, hematologic malignancy type) and HCT types among data sources.

CIBMTR
Since 2007, participating CIBMTR institutions have been required to report data from all consecutive allogeneic HCT procedures on standardized data collection forms completed by data managers and submitted at the time of HCT and at 100 days, 6 months, and annually post-HCT until year 6 and biennially thereafter until death [2,14,17]. Most centers also voluntarily report autologous HCT procedures on the same forms [14]. The CIBMTR collects data at 2 levels: Transplant Essential Data (TED) and Comprehensive Report Form (CRF) levels. The TED dataset is an internationally accepted standard dataset that contains a limited number of key variables for all consecutive HCT recipients. TED-level data, with some additional details of donor and graft characteristics, comprise the obligatory data submitted to the Stem Cell Therapeutic Outcomes Database. When an HCT is registered with the CIBMTR, a subset of patients is selected for the CRF level of data collection through a weighted randomization scheme. The CRF captures additional patient-, disease-, and treatment-related data. The broad range of information reported to the CIBMTR includes patient demographics, clinical characteristics, HCT details, transplantation center, and key clinical outcomes, including relapse, acute and chronic graft-versus-host disease, common HCT complications, vital status and cause of death, and selected late effects for patients on the CRF data collection track) [14]. Data compliance and quality are monitored by onsite audits, computerized checks for discrepancies, and physicians' review of submitted data. Studies are performed in compliance with all applicable federal regulations pertaining to the protection of human research participants. Patients and/or guardian(s) provide written informed consent for data submission and research participation.
Within the CIBMTR database, patients eligible for the linkage underwent allogeneic or autologous HCT for a hematologic malignancy (Table 1) diagnosed between January 1, 1991, and December 31, 2016, consented for research, and were California residents based on ZIP code at the time of HCT or transplantation at a California HCT center if residential ZIP code was missing. For subsequent analyses, the hematologic malignancy type that was the indication for HCT was categorized according to the World Health Organization 's Classification of Tumors of the Hematopoietic and Lymphoid Tissues, 4th  Edition (Supplementary Table S1A) [18]. HCT centers were characterized by volume based on the total number of allogeneic HCTs performed at each center over the entire study period, grouped into tertiles (low, <140; medium, 140-459; high, ≥460) to investigate the relationship of HCT volume with data completeness and quality.

CCR
The CCR is California's population-based cancer surveillance system since 1988, collecting cancer incidence on approximately 99% of new cancer cases (www.ccrcal.org) and harmonizing data from the regional cancer registries within the state. Information is collected on patient sociodemographics (eg, age at diagnosis, sex, race/ethnicity, neighborhood socioeconomic status, health insurance at diagnosis/initial treatment), tumor characteristics (eg, date of diagnosis and tumor site, histology, and stage at diagnosis), initial treatment (eg, chemotherapy/immunotherapy, radiotherapy, surgery), and HCT (data collection required for patients diagnosed from 2003 forward) [15,19]. The CCR also obtains vital status, last known follow-up date, and underlying cause of death captured on the death certificate through hospital follow-up and linkages to state and national vital statistics and other databases. In addition, the CCR has established linkage mechanisms with LexisNexis to obtain updated address and determine whether patients have emigrated from the state.
Within the CCR, patients eligible for linkage were diagnosed with a hematologic malignancy (Table 1) Table S1B) [18].

PDD
Since 1991, the California Department of Health Care Access and Information has mandated reporting of diagnostic and procedure codes on all inpatient hospitalization admissions in California from nonfederal hospitals across the state into the PDD (https://hcai.ca.gov/). For all hospital admissions, the PDD data provides patient demographics (eg, age, sex, race/ethnicity, ZIP code of residence, health insurance), hospital, and clinical information (eg, admission and discharge date, principal diagnosis and up to 24 secondary diagnoses, principal procedure and up to 20 secondary procedures). Diagnoses and procedures were coded based on the International Classification of Diseases, Ninth (ICD-9) and Tenth (ICD-10) Revisions.

Linkage
The protocol for this study was approved by the Institutional Review Boards of the University of California Davis, the California Committee for the Protection of Human Subjects, and the National Marrow Donor Program and was determined to not be human subjects research by the National Cancer Institute. We undertook the linkage of the CIBMTR, CCR, and PDD datasets in multiple stages (Figure 1), relying on a combination of the following 9 linkage identifiers: date of birth, sex, Social Security number (SSN), zip code of residence, date and type of hematologic malignancy diagnosis, date and type of HCT, and transplantation center. Table 2 provides completeness of each variable by  data source, and Table 3 specifies the combinations of identifiers that enabled a match. Most linkage identifiers were very complete (>90%) across all data sources except for SSN (available for 11.2% of CIBMTR patients) and ZIP code of residence (available for 39% of CIBMTR patients) ( Table 2).
In the first step, we linked the CCR and PDD admissions data using the CCR's deterministic and probabilistic algorithms [20] routinely used for research ( Figure 1). We then linked all CCR patients with a hematologic malignancy (regardless of linkage status to the PDD) to the CIBMTR. Finally, we linked CIBMTR patients who did not match to the CCR directly to the PDD after restricting the PDD to admissions that had HCT procedure codes. All data elements except hematologic malignancy type and date of diagnosis, which are not available in the PPD, were used in this final linkage.
The National Cancer Institute's Match*Pro v1.6.1 software [21] was used to conduct the linkages using a probabilistic approach based on the Fellegi and Sunter model [22]. Blocking sensitivity was set to 2, so that records from the data sources were compared if at least 2 blocking variables (SSN, birth date, hematologic malignancy diagnosis date, HCT date, ZIP code of residence, hematologic malignancy type [defined for linkage purposes as leukemia/preleukemia, myeloma, and lymphoma]) were equivalent. Matching variables were used to calculate the m probability (ie, the probability that the 2 fields agree for a matching pair) and u probability (ie, the probability that 2 fields agree for a nonmatching pair) for each field. The match classification filter in Match*Pro was used to customize the classification of matches to our matching criteria in Table 3.
Uncertain matches, many-to-one matches, and nonmatches were manually reviewed, with data for linkage identifiers systematically compared to determine agreement. Manual review was applied to determine agreement between facility names, as names were not standardized within the CIBMTR or across data sources. Facilities were considered to be in agreement if names were in a different order or an acronym was used. In addition, race/ethnicity data from the CIBMTR and CCR were used only for manual review, as all of Match*Pro's blocking and matching parameters were maxed out with the other parameters.

Statistical Analyses
We evaluated the relationship of patient-, HCT-, and HCT center-related factors with match status (unmatched or matched to the PDD only versus matched to the CCR with or without the PDD). Data are presented at the patient level of the first HCT by HCT type (allogeneic versus autologous). In addition to descriptive statistics (n, frequency), we constructed polytomous logistic regression models to identify key predictors of match status, first evaluating univariable results and then multivariable results because of correlation among different factors (eg, indication for and age at HCT). Multivariable models included linkage variables, HCT center volume, HCT year, sex, race/ethnicity, age at HCT, and hematologic malignancy type. Results are presented as adjusted odds ratios (ORs) and associated 95% confidence intervals (CIs).
We determined the concordance of broad hematologic malignancy type and subtype between the CIBMTR and CCR. We also determined the concordance of HCT type (allogeneic and autologous) between the CIBMTR and the PDD and CCR. Hematologic malignancy diagnosis and HCT procedure dates were compared using ≤90-day and >90-day thresholds. Finally, we compared the concordance of HCT centers between the CIBMTR and the PDD. Analyses were conducted using SAS version 9.4 (SAS Institute, Cary, NC).

RESULTS
In a multistep linkage (Figure 1), we attempted to match 22,733 patients in the CIBMTR who underwent 1 or more autologous or allogeneic HCTs (n = 24,938 transplantations) for a hematologic malignancy diagnosed between 1991 and 2016 to data from the CCR and the PDD. A total of 20,228 (89.0%) matched, including 17,707 (77.9%) matched to both datasets, 1179 (5.2%) matched to CCR only, and 1342 (5.9%) matched to PDD only. A total of 2505 (11%) CIBMTR patients remained unmatched. Figure 2 presents the availability of linkage variables and characteristics of CIBMTR patients by their matching status. CIBMTR patients who remained unmatched to either CCR or PDD were slightly more likely to have undergone a first autologous HCT (12.6%) rather than a first allogeneic HCT (9.0%). In multivariable models, regardless of HCT type, patients who remained unmatched were more likely to have incomplete data for the linkage variables and to have other/unknown race/ethnicity in the CIBMTR (Table 4). In addition, among patients who underwent allogeneic HCT, those who remained unmatched were most likely to have undergone transplantation between 2001 and 2010 and from a low-volume center or center outside California, to be age ≥40 years at receipt of HCT, and to have undergone transplantation for chronic lymphocytic leukemia/small lymphocytic lymphoma, myelodysplastic syndrome (MDS), or other myeloproliferative neoplasm (MPN). Among autologous HCT recipients, patients who remained unmatched were most likely to have undergone transplantation between 1991 and 2005 and from a medium-volume center or a center outside California, and to have undergone HCT for a plasma cell neoplasm. Results generally were similar for patients who matched to the PDD only, except they also were more likely to have undergone HCT at a younger age (<40 years) ( Figure 2, Table 4).

Concordance of Hematologic Malignancy Type between the CIBMTR and CCR
We compared the hematologic malignancy diagnoses from the CCR and each HCT indication in the CIBMTR. Because of differences in the specificity of the information and changes in the classification of hematologic malignancies over the last several decades, we considered concordance by broader type (eg, Hodgkin lymphoma) as well as by subtype (eg, nodular sclerosis classical Hodgkin lymphoma) (Supplementary Table S1A, B). The CIBMTR and CCR were concordant for both hematologic malignancy subtype and diagnosis date for 75.6% of HCTs and for broad hematologic malignancy type and date for an additional 10.1% of HCTs, generally because of one source having a more specific diagnosis and the other having a consistent but less specific diagnosis (Figure 3). An additional 7.1% of HCTs were concordant by hematologic malignancy type or subtype, but not by date (diagnosis date >90 days different). For 3.0% of patients, hematologic malignancy types were considered likely matches because they were consistent with known changes in diagnostic criteria (eg, MDS versus acute myelogenous leukemia [AML] occurring within 90 days, owing to a change in the blast count threshold), composite diagnoses (eg, follicular and diffuse large B cell lymphoma occurring within 90 days), or transformations (eg, MDS or MPN to AML, follicular lymphoma or chronic lymphocytic leukemia/small lymphocytic lymphoma to diffuse large B cell lymphoma with diagnosis dates >90 days apart).

HCT Concordance of CIBMTR HCTs with PDD and CCR Data
Overall, 77.4% of HCTs, including 69.9% of allogeneic HCTs and 84.3% of autologous HCTs, from the CIBMTR were found in the PDD. Concordance of CIBMTR HCT data with the PDD in terms of both HCT type and date occurred in 96.6% of HCTs (94.9% allogeneic, 97.8% autologous) ( Figure 4A). In addition, 98.1% had concordance between reported HCT center in the PDD and the CIBMTR (data not shown). A lower proportion (49.4%) of HCTs from the CIBMTR had HCT recorded as part of their initial treatment in the CCR. For those with an HCT noted in the CCR, concordance in terms of both date and type of HCT was 63.5% ( Figure 4B).

DISCUSSION
In this large data linkage of HCTs performed in California over nearly 3 decades, we successfully matched 89% of patients reported to the CIBMTR to the CCR and/or PDD, which is comparable to or higher than similar linkages, including the linkage of CIBMTR data to Medicare administrative claims data [23] and linkages of state cancer registries with administrative claims data [24,25]. The likelihood of successful matching was significantly reduced if complete identifiers were missing and/or if HCT occurred prior to 2010; hematologic malignancy type and HCT center characteristics also were associated with the likelihood of matching. However, demographic data were generally similar in matched and unmatched patients, suggesting that our linkage was representative of the CIBMTR HCT population in California. The high matching rate in our study allows for the use of these complementary data sources in a range of analyses including investigations into HCT access, healthcare resource utilization, late effects (eg, cardiovascular disease, subsequent malignancies), and cause-specific mortality.
As expected, individuals with fewer reported linkage variables were less likely to match with the CCR. The impact of missing linkage data was observed across individual linkage variables, suggesting that each variable was valuable in the linkage process. A known challenge in performing high-quality linkages is the lack of personal identifiers [26], and our linkage lacked or had limited availability of several key identifiers, including patient name and SSN for CIBMTR patients. In a prior study of 11,358 patients who underwent HCT in 2010 to 2012 in the Medicare claims data, 80% matched to the CIBMTR, even though SSN was missing for 64% [23]. In the present study, although a higher portion of patients with an SSN matched (96%), 71% of patients without an SSN matched with other identifiers, including date of birth, sex, HCT date, and HCT type [23]. In addition, a linkage of the Utah cancer registry with all-payer claims data (82.4% overall linkage) found relatively high matching success without the SSN when other linkage variables were known [25].
Because ZIP code of residence was not available for the majority of CIBMTR patients, it is likely that most of the 11% of patients who did not link to the CCR were not California residents at the time of hematologic malignancy diagnosis. As the CCR only ascertains reportable cancer among those who live in California at the time of diagnosis, out-of-state residents who underwent HCT in California would not match in the CCR. In addition, certain variables used in the linkage were not standardized, such as the name of the reporting transplantation center, resulting in the need for a manual review. Despite these challenges, our process resulted in a high matching rate that approached or surpassed previous large data linkages [24,25]. Improving the completeness of the key identifiers used in this study and including patient names will facilitate high-quality linkages in the future.
We found that the likelihood of successful linkage improved over time, commensurate with the implementation of reporting requirements for allogeneic HCT. In addition to calendar year, the ability to successfully link across data sources varied by hematologic malignancy type and HCT center characteristics. For example, linkage of cases with MDS and MPN was less successful than linkage with AML, which was anticipated given that MDS/MPN were not reportable to the CCR until 2001. For autologous HCT, linkage of multiple myeloma was less successful than other malignancies, even after adjusting for the availability of linkage variables and other patient characteristics. This may be due to performing autologous HCT in the outpatient setting (limiting linkage to the PDD) or may be a consequence of the indolent clinical course in certain patients, allowing for emigration from the state of diagnosis. Finally, transplantation center volume was associated with linkage success in both autologous and allogeneic HCT recipients, with high-volume centers more likely to match, perhaps because of increased familiarity or staffing to conduct CIBMTR data collection and reporting.
Among patients reported to the CIBMTR who matched to the CCR and/or PDD, we observed a very high proportion concordant for hematologic malignancy type and date of diagnosis. In addition, for a small proportion of patients, we found that diagnostic criteria changes, composite diagnoses, and transformations likely impacted the concordance between data sources. We also found that 23% of HCTs from the CIBMTR were not included in the PDD, likely related to the PDD not capturing the HCTs performed in outpatient settings. Moreover, 51% of HCTs from the CIBMTR were not found in the CCR. The CCR findings may be related to HCTs being performed further out from hematologic malignancy diagnosis, given that cancer registries are mandated to capture first-course treatment within approximately 1 year of diagnosis, or underreporting of some cancer treatments, including HCT [15,[27][28][29]. Our findings highlight the importance of augmenting population-based cancer registry data with HCT data from other sources, including PDD, outpatient, or CIBMTR data, to fully capture its utilization.
Although the size and scope of our study are strengths, inconsistencies and inaccuracies in reporting are possible, as in any population-based registry study; however, our high rate of matching and the ability to examine sociodemographic and clinical characteristics across data sources are anticipated to result in improved data quality. The lack of personal identifiers was overcome through a robust approach to the linkage methods and the use of 3 different data sources, which increased the likelihood of matching to at least 1 of the data sources. Finally, the use of cancer registry and hospitalization databases from California might limit the generalizability of our results to other settings that may differ in racial/ ethnic and socioeconomic status composition or by referral and transplantation practices. Nonetheless, the linkage methods used in this study can be applied to other geographic regions with cancer registries and statewide hospitalization or all-payer claims databases.
In summary, we have demonstrated the successful linkage between the CIBMTR and 2 state-wide population-based data resources (CCR and PDD) to generate a unique database that can address a host of health outcomes, health services, and clinical research questions. Clinical HCT practices have evolved substantially in recent decades, particularly with the introduction of reduced-intensity conditioning regimens, reduced dependency on radiationbased conditioning, growing use of alternative stem cell donor sources, and expanding approaches for the diagnosis and management of graft-versus-host disease. How these advances have affected access to HCT for patients in need is poorly understood, as few studies contain large populations of both transplantation recipients and non-recipients with hematologic malignancies. In addition, the impact of these changes on long-term adverse effects that may occur many years after HCT (eg, total body irradiation-related subsequent malignancies) is not well understood. Data linkages such as ours that capitalize on the strengths of multiple complementary data sources provide an opportunity to capture information related to cancer diagnosis, treatment, and outcomes that can address these research gaps and serve as a source for emerging questions in cellular therapy.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.  Linkage variable completeness and characteristics of patients (%) who underwent HCT and are registered with the CIBMTR by HCT type and match status. CA, California; NH, non-Hispanic. Concordance between CIBMTR indication for HCT and CCR hematologic malignancy type among CIBMTR patients who matched to the CCR. Includes 20,820 CIBMTR-registered HCTs matched to a CCR hematologic cancer. Does not include patients who did not match to the CCR (n = 1447 to the PDD only; n = 2505 unmatched). Broad malignancy type consistent includes one source having a more specific diagnosis and the other having a consistent but less specific diagnosis. Likely match includes known changes in diagnostic criteria, composite diagnoses, and transformations.      * Adjusted for all variables in the table; patients with unknown sex (n = 32) or unknown age (n = 292) were excluded from the models.